A Comparative Study for WordNet Guided Text Representation

نویسندگان

  • Jian Zhang
  • Chunping Li
چکیده

Text information processing depends critically on the proper text representation. A common and naïve way of representing a document is a bag of its component words [1], but the semantic relations between words are ignored, such as synonymy and hypernymy-hyponymy between nouns. This paper presents a model for representing a document in terms of the synonymy sets (synsets) in WordNet [2]. The synsets stand for concepts corresponding to the words of the document. The Vector Space Model describes a document as orthogonal term vectors. We replace terms with concepts to build Concept Vector Space Model (CVSM) for the training set. Our experiments on the Reuters Corpus Volume I (RCV1) dataset have shown that the result is satisfactory.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Representation of textual documents by the approach wordnet and n-grams for the unsupervised classification (clustering) with 2D cellular automata: a comparative study

In this article we present a 2D cellular automaton (Class_AC) to solve a problem of text mining in the case of unsupervised classification (clustering). Before to experiment the cellular automaton, we vectorized our data indexing textual documents from the database REUTERS 21,578 by Wordnet approach and the representation of text documents by the method n-grams. Our work is to make a comparativ...

متن کامل

A Comparative Analysis of Particle Swarm Optimization and K-means Algorithm For Text Clustering Using Nepali Wordnet

The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection of data on the web there is a need for grouping(clustering) the documents into clusters for speedy information retrieval. Clustering of documents is collection of documents into groups such that the documents within each group are similar to each other and not to documents of other groups...

متن کامل

The Use of WordNets for Multilingual Text Categorization: A Comparative Study

The successful use of the Princeton WordNet for Text Categorization has prompted the creation of similar WordNets in other languages as well. This paper focuses on a comparative study between two WordNet based approaches for Multilingual Text Categorization. The first relates on using machine translation to access directly the princeton WordNet while the second avoids machine translation by usi...

متن کامل

Arabic text categorization: a comparative study of different representation modes

The quantity of accessible information on Internet is phenomenal, and its categorization remains one of the most important problems. A lot of work is currently focused on English rightly since; it is the dominant language of the Web. However, a need arises for the other languages, because the Web is each day more multilingual. The need is much more pressing for the Arabic language. Our research...

متن کامل

Text Classification Using WordNet Hypernyms

This paper describes experiments in Machine Learning for text classification using a new representation of text based on WordNet hypernyms. Six binary classification tasks of varying difficulty are defined, and the Ripper system is used to produce discrimination rules for each task using the new hypernym density representation. Rules are also produced with the commonly used bag-of-words represe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005